The Airbnb short staying rental data exploratory analysis and price prediction, a case study of Beijing

Programming Tools for Urban Analytics, Univeristy of Glasgow

Student ID: 2516202
29 March 2021

1. Introduction

Under the background of massive tourism, the term "short staying" is popular with the development of e-commerce torisim in China. Airbnb offically entered Chinese home-sharing market in March 2017 (Airbnb, 2019), and is an online platform for short-term home and apartment rentals. China is one of the biggest markets for Airbnb in the world, and can be its largest potential market by 2021 (Zhang and Fu, 2020). Therefore, it is necessary to understand the consumers' rental experience and preference in the sharing economy in China.

From the perspective of internet big data, short staying rental is becoming a bridge and driving force for the evolution of tourism accommodation industry. One challenge that Airbnb rental landlods face is determining the approporiate rent price. In many areas, renters are presented with a good selection of listings and can filter by criteria such as price, number of bedrooms, room type, rent periods and so on. Since Airbnb is an online market, the price a landlord set is ultimately tied to market price. Thus, this report aims to explore Airbnb guests' rental experience and preference in Beijing by using exploratory data analysis (EDA) and machine learning to analysis short staying rental price and discuss the feature importance.

This data science report mainly contains: 1) the data of rental price, geographical location, reviews and other data in various regions of Beijing, and 2) uses Lasso, Ridge, ElasticNet, and Linear Regression models for rental price prediction machine learning.

2. Data

Airbnb does not release any data from its website, but Inside Airbnb, an watchdog website launched by Murray Cox in 2016, reports and scrapes data on the property rental marketplace company Airbnb, For this report, the data set was scraped on 22 Feburary, 2021, on the city of Beijing, China (Source: http://insideairbnb.com/get-the-data.html). It contains information on all Beijing Airbnb listings.

2.1 Data overview

This section will process 'listings_detailed' data set and check the data information for the followed up (EDA) and machine learning sections. From the data head, we can see that the data set mostly contains numeric type variables such as price, longitude, latitude, number of reviews etc. However, some variables such as property type, neighbourhood_cleansed are text variables. There are 74 categories and 24977 observations in total for listings_detailed data set.

2.2 Data processing

The data set used for this report from Inside Airbnb is web scraping data, it may contain some invalid variables which are irrelevant to price prediction. In terms of detailed data set, free text columns were dropped as well as other features not useful for predicting price (e.g. host_url, host_about, and other host-related features that are unrelated to the property). There were repeated columns for minimum and maximum night stays, were few differences between minimum_maximum_nights, minimum_minimum_nights etc. In addtion, the descriptive variable, such as neighbourhood_overview (the overview of the room address features), can be reflected by latitude and longitude, as the HeatMap showed in section 4 below. Other NA variables such as license, neighbourhood_group_cleansed, bathrooms, calendar_updated etc can be discarded as well. In addtion, due to all of the listings being located in Beijing, the host_location column was dropped. After pre-data processing, there are 36 vairables will be used for exploratory data analysis.

Rename the unreadable rows

We notice that neighbourhood_cleansed of neighbour and detailed data set contain Chinese and some contains Chinese and English, besides, there are USD signs in each price values, so we rename each district in both data sets and eliminate the '$' sign as well.

Data filtering

We can see the highest price of one night rental is about 1 billion USD and the lowest is 0 USD. Evidently, these values are absurd and are outliers. In addition, the average rent price is around 41,000 USD per night, however, the average disposable income per month in Beijing is around ¥ 6,147 (around 961 USD) (Source: Beijing Municipal Bureau of Statistics). Thus, we discard 41000 and set 961 as the highest expenditure of short staying rental price in Beijing.

3. Methods

3.1 Exploratory Data Analysis (EDA)

The Exploratory Data Analysis provides the profiles of data set and presents the relationship between each feature and the target variable, and explores the data set across different dimensions by using visualizations, e.g. graphs and plots (Hoaglin and Tukey, 2000). In this case, we will use listings_detailed and neighbour data set that was assigned in Secition 3 above. There are five aspects in this section: 1) Rental distribution; 2) Rental amenities; 3) Room type; 4) Review mining, and 5) Price distribution. Some particular features and their explanation will be used for EDA are as follow:

In addition, in terms of reivews analysis, we add reviews_detailed data set for mining tenants' comments and check their renting experience on the short staying rentals.

3.1.1 Rental distribution

The pie chart illustrates the number of Airbnb rental distribution in each district in Beijing. Overall. over 55% of the rentals concentrate in Chaoyang District, Fengtai District, Haidian District, and Xicheng District, which means that their demand is higher than other areas. Furthermore, the rental distribution of other areas are even (approximately 5%-6%) except Pinggu and Mentougou District, accounting for only nearly 1%.

Fig. 2 prensents the overall rental house geographical location distribution in Beijing by latitude and longitude. In the location plot, we can see that the lower central areas (Fengtai, Xicheng, Daxing) contain more housing, and it is understandable that the central two districts in Beijing — West and East — are the historic districts, together with three near-ring-road districts (Hai Dian, Chao Yang, Feng Tai), where the rentals can be density (Li and Biljecki, 2019). In terms of Pinggu and Mentougou, those uncentralised areas are situated in distant locales, where the population can be dispersed.

3.1.2 Rental amenities Wordle

The amenities Wordle illustrates the basic and entertained facilities/furniture in short staying rental. Overall, 'Wifi', 'Elevator', 'Dedicated workspace', 'Free parking on premises', and 'Fire extinguish' appear most frequently, which means that most tanents place more emphasis on the rental environment, safety and workable. In addtion, short staying rentals also provide long term stays offers, which means that it can be an altrenative or temporary choice for those tenants who are looking for a long term rental.

3.1.3 Room type

Number of room type

We can see that there are three types of room in Airbnb rentals: Entire room, Private room, and Shared room. The entired room offers tenants accessing independent living rights, which is suitable to a family or high living reqirement consumers, but is most expensive one. The bar chart illustrates that entire room (about 11,800) is dominant and main stream among all room types, which is around 4,000 rooms more than Private room. On the onctrast, there are a few shared rooms (about 1,000) in Beijing, which means that the demands are lower than others. Most of these rooms are inn and dormitories, which are cheaper than others and suitable for backpackers or individual travellers.

Room type number in each district

In terms of each district, overall, most areas contain higher numbers of Entire room, except Changping (about 43%), Dongcheng (about 35%), Pinggu (about 43%), Huairou (about 20%), and Yanqing (about 38%). All the shared room proportion is the lowest one for each district, accounting for about lower than 10% for each one. Particularly, the three highest rental areas, Chaoyang, Dongcheng, and Fengtai, have different room type distribution. Interestingly, among these areas, Fengtai contains most Entire room, Dongcheng contains most Private room, but Chaoyang contains the highest number of Shared room, which means that tenants' room preference in these areas are diversified.

3.1.4 Reviews mining

Reviews and price

The scatter plot illustrates the relationship between price and reviews. Overall, we can see that the price range of less reviews is larger than more number of reviews'. For 0 reviews, the price is dispersed from 0 (exclude) to 961 USD (include). There is an over 400 reviews rental in the right side of the plot and has only one around 600 USD rental price. In other words, the more reviews the rental has, the more stable price can be.

In terms of reviews number analysis, we assume that higher number of comments is the criteria of a popular and high score rental. Thus, we set the number of Q3 comments (the number of comments higher than 75% of the number of overall comments) as 'good rental' and the Q1 numbers comments (the number of comments is lower than 25% of the number of overall comments) as 'not satisfied rental'. Comapred to the price description for total smaple, the price-filtered listings detailed data set (in section 3) makes sense. We can see that total average rental price is 380 USD, and the satisfied price is 357 USD and unsatisfied price is 393 USD, we can see that people's satisfied rental price is around but lower than average rental price.

Comments Wordle

With the development of technology, social media supports a great amount of user-generated content (UGC) on a wide range of platforms like vlogs, blogs. UGC is widely considered as an important resource to find out user experience and preference which can be references to understand user behaviour (Zhang and Fu, 2020). In this section, we use reviews_detailed data set for Wordle mining and explore the frequent experience of Airbnb renters.

TheJieba package is used for spliting Chinese characters, and jieba.cut(s) means that the sentence can be separated accuratly. The Wordle above presents people's comments to the short staying rental. Overall, most comments are positive such as '很好' (good, very satisfied, nice landlord). Particularly, '干净整洁' (clean and neat room), '位置方便' (good location and subway nearby), and '安静' (quiet) are most frequently comments. It reflects that public transport situated and delicated room inside quality rentals are popular to Beijing Airbnb users.

3.1.5 Price distribution

Price bar chart and boxplot

Fig. 6. illustrates the average Airbnb rental price in each district from in the sequence of highest to lowest. Xicheng contains the most highest average rental price (accouting for about 510 USD per night), followed by Huairou, Pinggu, Miyun, and Mentougou, accounting for about 450 USD respectively. Yanqing and Changping hold the lowest price, accouting for about 300 USD. Interstingly, the average rental price of Chaoyang and Fengtai, two of the three most popular regions, is about150 USD lower than Xicheng. Similarly, some city centre regions such as Dongcheng (380 USD), the rental price are lower than distant regions, such as Shunyi (about 420 USD), Fangshan (about 410 USD), Miyun and so on.

On the contrary, Fig 7 may explain those interesting price difference phenomenons. Overall, the highest rental price in some distant regions, such as Huairou, Miyun, Pingfu, and Yanqing, are nearly 961 USD. The lowest price of those areas is all around 270 USD to 310 USD. Furthermore, most districts' median price is from 300 USD to 400 USD, except Huairou, Miyun, Pinggu, and Yanqing, which are all hogher than 400 USD. Besides, over 75% of the rental price in those areas is about 640 USD, this can explain the higher average rental price in these areas. In terms of the three most popular regions, Dongcheng involves the highest price (around 940 USD), and around 570 USD price accounts for 75% rentals. Followed by Chaoyang, around 75% rental price is around 500 USD, and the lowest price is around 210 USD.

Price Map

The price heat map reflects the price distribution intuitively. Similar to the rental geographical location plot (Fig. 2.), the ring-inside districts, Chaoyang and Dongcheng, there are more rentals in these areas and the price is higher than other areas. Particularly, the price in Central Radio & Television Tower, Sanlitun and Beijing workers stadium, suituated in Chaoyang district and sourrounded by subway line 10, 8 and 6, is most intensive and the highest. Similarly, Gulou street (Dongcheng district), Qianmen (located next to the Forbidden City, Xicheng), and Universitiy clusters net (including Tsinghua, Peking Uni etc, Haidian) also present a densitive and higher price. It is easy to notice that the CBD regions, tertiary organisations, and tourist areas can be renter's first short staying housing choice.

3.2 Machine learning

3.2.1 Data Cleaning
3.2.3 Correlation test

Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. The correlation test is used to check the multicollinearity between each variables. Fig.4 presents that there is a strong relation between 'availability_30', 'availability_60', 'availability_90', and 'availability_365' (around 0.8). In this case, we can see that 'availability_365' contains other three variables, so we will only keep 'availability_365' for the machine learning model. Similarly, review_scores_rating, review_scores_checkin, and review_scores_location are highly correlated (around 0.75). We can see that availability 365 days contains 60 days, and the review socres can reflect the checkin score. Similarly, calculated_host_listings_count contains private room, entire room, and shared room. which can be counted as a perfect linear relations for those variables.

The heat map (Fig. 8) presents that there is a positive relation between latitude and rental price, which means the price of northern rentals is more expensive than the southern. In this case, we put 'price' in the first line and check the correlations between other independent vairables, and Fig. 9. presents that the rest variables all show a low correlation, the highest correlation is 0.42, and we can say that the those variables are appropriate for conducting regression models.

3.2.4 Normality check

'price' is the target variable (or dependent variable) for machine learning. We check its normality The log() price distribution skew shows that the data appears more normally distributed (Fig. 10).

3.2.5 Feature engineering
Data transformation

We encode and transfer categorical variables to vector, take 'neighbourhood_cleansed' as an example, 0 means the place here is not a neighboorhood and vice versa. Then, we split train data set and test data set from sklearn.model_selection pacakge. The final total data set contains 19,887 observations and 186 processed (transformed) independent variables, and the train data set contains 185 independent variables and 14,915 obervations and test set contains 4,972 and 185 respectively. We choose median price (0.55 USD) as the baseline and can be used as a reference to check the model fitness.

Baseline
3.2.6 Modelling

To model the relationship between rental prices and property proximity to certain venues based on Airbnb data, we use Multiple Linear Regression, Lasso Regression, and Ridge Regression model for the price prediction machine learning. We try all alpha and l1ratios for each model to get the best results.

RMSE indicates the sum of the squared errors in the model's predictions and suggests the model's predictive capabilities. In this case, we can say that train data errors are reasonable to understand how it generalises and performs data that did not use for training before. From the model results, the RMSE values should come out to be approximately 0.559. In terms of Linear Regression model, the train error is around ± 0.422 USD, where the test erroer is worse (over 90 billion). This can be interpreted that LR overflows and does not convert many variables. For Ridge Regression model, the outcomes make more sense that the errors are about 0.423 and 0.429 for train and test respectively, and the best alpha is 1000. ElasticNet model is similar, all errors are around 0.424 and 0.429 respectively, and the best alpha is 0.01, best l1 ratio is 0.1. For Lasso, the rmse is the closest compared to the baseline (0.559), and the train and test rmse is 0.433 and 0.436 respectively. Besides, Lasso Regression model removed some more irrelevant variables, thus we conclude say that Lasso performs the best among other models.

3.2.7 Error analysis

The error analysis can present the difference between estimation and acutal sample values. We choose Lasso regression model for the error analysis. Fig. 12. shows that the error distribution of Elastic model. We can see that most estimations are around 0, however, there are still several unaccurate predictions surpass the true values from 1 to 2, which can be the reason that makes the average error large.


4. Summary

4.1 Main findings

This report presents the Airbnb data analysis in Beijing (China), with focus on exploratory data anaylsis on locations, room types, and price. To the end, we employ Lasso, ElasticNet, Ridge, and Linear regression methods for price data prediction.

The EDA reveals the relationship differences of price between rental location, review numbers, and room type and raises the question whether price are driven by the same features. The results show that the city centre areas, Dongcheng, Fengtai, Chaoyang, and Xicheng, have most number of rentals and centralized price. The room type analysis suggests that most people prefer entire room/apartment in total (Voltes-Dorta and Sánchez-Medina, 2020), and some of those city centre tenants would choose private rooms. What's more, the Wordle of amenities and comments present that Airbnb renters perfer well-structured transportation system located rentals and high living quality rooms. Finally, the rental price is focusing on tourists areas, university areas, and some CBD regions.

The machine leanrning present the price influencing factors analysis and the prediction model fitness. In terms of the relevant factors analysis, we can see that accomodation and latitude have higher positve relations with price, whereas the minimum available nights and number of reivews show a negative influence, which means that the more reviews and nights the rental have, the lower of the rental price can be, in other words, it can be more difficult to landlord to assign an appropriate price. The modelling section uses Lasso, Ridge, LR, and Elastic models for price prediction and the results show that Lasso Regression has the best performance (with 0.436 RMSE value) and can be explained due to the omits of some irrelevant variables.

4.2 Suggestions

Overall, the models and EDA imply resonable results of price and each key features and most findings are consistent with previous research. However, there are still some research gaps for this report. Firstly, some distant areas, such as Pinggu, Huairou have highest rental price, there is no significant evidence that supports the exorbitant rental price and this needs further exploration. Secondly, the date set is web-scraped type and can have many limitations, it may contain too many unrecognized information, in this case, we can see that we drop over 45 variables out of 74 variables, and over 20 variables are NA or repeated or irrelevant text variables. This may have some influence on our modelling fitness as our number of variables may not be comprehensive enough. Thus, we can optimise the web scraping method to get more accurate data set.


References

Airbnb. Fast Facts. Available online: https://press.airbnb.com/fast-facts/ (accessed on 11 June 2019).

Belarmino, A., Whalen, E., Koh, Y. and Bowen, J., 2017. Comparing guests’ key attributes of peer-to-peer accommodations and hotels: mixed-methods approach. Current Issues in Tourism, 22(1), pp. 1-7.

Hoaglin, D. and Tukey, J., 2000. Understanding robust and exploratory data analysis. New York [i 5 pozostałych]: John Wiley & Sons.

Inside Airbnb (2021), Adding data to the debates: (http://insideairbnb.com/get-the-data.html) [accessed 01/03/2021]

Li, J. and Biljecki, F., 2019. THE IMPLEMENTATION OF BIG DATA ANALYSIS IN REGULATING ONLINE SHORT-TERM RENTAL BUSINESS: A CASE OF AIRBNB IN BEIJING. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, IV-4/W9, pp. 79-86.

Voltes-Dorta, A. and Sánchez-Medina, A., 2020. Drivers of Airbnb prices according to property/room type, season and location: A regression approach. Journal of Hospitality and Tourism Management, 45, pp. 266-275.

Zhang, Z. and Fu, R., 2020. Accommodation Experience in the Sharing Economy: A Comparative Study of Airbnb Online Reviews. Sustainability, 12(24), p. 10500.